Skip to content

Latest commit

 

History

History
74 lines (44 loc) · 2.31 KB

0 Welcome.md

File metadata and controls

74 lines (44 loc) · 2.31 KB

A Short self paced Spark course

Watch the video

This is a 10 hours course to teach the basics of Spark to students who already are familiar with coding and machine learning.

The examples will all be given in python.

Read more

Go to the creator of this tool: https://github.com/databricks/Spark-The-Definitive-Guide

Download the whole repo. This will also provide the data used in the examples.

References

[SDG] The book: Spark the Definitive Guide
[2] https://luminousmen.com/post/spark-tips-partition-tuning

Running the example code

Either use the free community edition of Databricks (https://community.cloud.databricks.com/)

or run locally on your PC (instructions are provided for linux/windows/Mac)

Focus of this course

  • understand the concepts
  • practice simple operations
  • get basic familiarity with configuration and tuning
  • run simple machine learning models

What is Spark?

What is horizontal scaling and vertical scaling?

Apache Spark is an open-source cluster computing framework.

Built on top of Hadoop MapReduce.

Utilizes In-memory computing.

Originally developed at UC Berkeley (2009).

Where to run my Spark server?

In a real production environment, Databricks managed cluster can be used (in the cloud), or MS HDInsight. We can also install our own Spark cluster, locally or in the cloud. The number of computers can reach thousands in the cluster.

During this course we will use a minimal installation on your own PC/Mac/linux machine.

Instructions are here: https://github.com/cnoam/spark-course/blob/master/readme.md

Where are the video recordings

https://panoptotech.cloud.panopto.eu/Panopto/Pages/Sessions/List.aspx?folderID=a2ea87f6-ac49-4444-b9bd-afa800a4f0c3

While developing: on my laptop "~/videos/spark videos"

Check yourself

Occasionally, you will have opportunities to check your knowledge. Try to answer/solve/execute all the questions. They will help you make sure you are ready for the next part!

  • Explain the difference between horizontal and vertical scaling
  • Check "Cluster" definition. Does it match what we have in Spark?

Technical details on recording: Ubuntu 22.04 kazam full screen recording use extension manager to hide desktop icons